This report explores a data set containing different attributes and quality scores of 1599 red wines.

The data set source was created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

It can be found at: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Univariate Plots Section

## [1] 1599   13

Our data set contains 1599 observations, and has 13 variables.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] 0

There are no NA values in our wine data set.

##        X          fixed.acidity   volatile.acidity  citric.acid
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00
##  Median : 2.200   Median :0.07900   Median :14.00
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00
##  total.sulfur.dioxide    density             pH          sulphates
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000
##     alcohol         quality
##  Min.   : 8.40   Min.   :3.000
##  1st Qu.: 9.50   1st Qu.:5.000
##  Median :10.20   Median :6.000
##  Mean   :10.42   Mean   :5.636
##  3rd Qu.:11.10   3rd Qu.:6.000
##  Max.   :14.90   Max.   :8.000

The two things that stick out the most are that there are 0 values for citric acid, and the huge range of total sulfur dioxide.

Let’s explore these discrepancies as we take a look at the graphs of each variable.

Our data is in tidy form and is complete. My overall objective is to find out which variables most affect the quality rating for each wine, so let’s begin by taking a look at how many wines we have at each quality level.

Nearly all wines are average. No wines scored less than 3 or more than 8 so none were terrible or exceptional. In addition, the majority of wines received a 5 or 6 for quality. Will this lack of differentiation allow us to find any insights?

##
## (2,3] (3,4] (4,5] (5,6] (6,7] (7,8]
##    10    53   681   638   199    18

We see the number of wines at each level of quality. We see that 1319 of the 1599 wines received an average score of 5 or 6. That’s over 82% of the wines.

Let’s take a look at the other variable counts beginning with alcohol as I suspect it will have the largest affect on quality ratings.

There are a few wines that give alcohol percentage to more than one decimal points. Let’s set a bin width to make the graph easier to read.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    8.40    9.50   10.20   10.42   11.10   14.90

The lowest alcohol percentage is 8.4 and the highest is 14.9. The percentage tops at 9.5% and there are fewer wines as alcohol percentage increases. I am curious how quality relates to alcohol percentage as well as how different combinations of alcohol and the other variables relate to quality. For example, does high alcohol and high sugar score better than high alcohol and low sugar or vice versa?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity is pretty much normal with a few outliers. It peaks around 7 units, with some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most of the wines have a volatile acidity between 0.2 and 1. Let’s look a little bit closer at this interval and increase our bin width slightly to better observe the peaks.

We have some peaks at 0.4, 0.5, and 0.6. Could this be that wines look to get those exact levels or just that some wines round their amounts differently?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   0.090   0.260   0.271   0.420   1.000

This is a strange distribution, but very little in terms of outliers.

Increasing binwidth and taking the log, we see a nearly uniform distribution that tapers down after 0.5. The noise from the original graph can likely be attributed to binwidth.

While there is a disproportionate amount of 0 values, they don’t seem too out of place. Also, because citric acid has such a strong flavor, it is used less frequently than other types of acid. This could help explain why many wines don’t contain any of it.

A wine’s acidic taste profile is determined by its total acidity. Let’s create the total acidity variable by combining fixed and volatile acidity. Citric acid is included in fixed acids, so won’t be added to total acidity.

We will use this variable later in the analysis to compare different flavor profiles in our wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.900   1.900   2.200   2.539   2.600  15.500

Almost all wines have sugar levels between 1 and 3, but there are some major outliers with a max of 15.5. Why do some wines have so much sugar? How does it affect the other variables especially quality? Let’s take a closer look at this and adjust bin width.

A closed look at the bulk of wines shows a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

This graph looks quite a lot like the sugar graph, normal at a certain interval but with a lot of outliers. This is demonstrated by the huge difference between Q3 and the max.

When zooming in closer and changing the bin width, it is much easier to see the normality of the graph. Although there are a lot of outliers, the levels are still very low at just over 0.6 g/dm^3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    7.00   14.00   15.87   21.00   72.00

The graph for free sulfur dioxide is skewed right. Let’s see if we can glean any information from the log of this graph.

Outside of a few outliers, there is nothing unusual about this graph. Nearly all wines contain fewer than 40 units of free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    6.00   22.00   38.00   46.47   62.00  289.00

Earlier, it was interesting to see the large range of total sulfur dioxide in the wines. We see here that this discrepancy is caused by only a few extreme outliers and so is not very important to investigate.

The graph of total sulfur dioxide is very similar to that of free sulfur dioxide. This is not much of a surprise. To be sure, let’s check the log of this graph as well.

Like the free sulfur graph, there is not much of note with this graph other than a couple of blank levels. In the previous graph we did see some extreme outliers but nearly all wines contained fewer than 150 units of total sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

From the histogram, it appears that a lot of the wines gave less accurate entries for density. Our boxplot shows a normal distribution with some outliers. Let’s adjust the bin width of the first plot to get a better look.

This graph is normal and nearly all wines fall within a small range of 0.99 to 1.005.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   2.740   3.210   3.310   3.311   3.400   4.010

Like nearly all of the variables, pH is normal and has some outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The majority of wines contain 0.5 to 0.7 levels of sulphates, and very few have more than 1. Yet some of the wines contain over 1.5 sulphates, but they are only a fraction of the population.

Univariate Analysis

What is the structure of your dataset?

The data set contains 1599 red wines each of which has 12 features: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality.

Other observations: All observations with the exception of citric acid are normally distributed. Nearly all of the wines got a middle of the road quality score of 5 or 6. No wines scored above 8 or less than 3. Many of the variables have extreme outliers.

What is/are the main feature(s) of your dataset?

The main feature is quality. My goal is to determine which of the other features of the wines most determine its quality score.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I believe that alcohol, the three types of acids, and residual sugar will have the highest influence on the quality of the wines as they most contribute to its flavor.

Did you create any new variables from existing variables in the dataset?

A wines taste is mainly determined by four factors. Acidity is one of those factors. I created a new variable, total.acidity, by combining fixed.acidity and total.acidity. This variable will be used to represent a wine’s acid profile.

I also created the variable single.quality.buckets which is an ordered factor variable of quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Citric Acid was the only variable with an odd distribution. The rest of the variables had either normal or skewed right distributions. The data came in tidy form and had no NA values, so no adjustments were necessary.

Bivariate Plots Section

With so many variables, lets look at a correlation matrix to see how variables relate to each other.

Let begin with the factors with the highest correlation to quality. The variables that correlate most with quality are alcohol, sulphates, volatile acidity, and citric acid. Quality has a moderate correlation with alcohol and a weak correlation with sulphates, volatile acidity, and citric acid. Let’s see if the graphs reflect this.

There is far too much over plotting in this graph to make any conclusions.

Adding some noise, we begin to see that quality increases as alcohol does, but let’s see if there is anything we can do to make the relationship more clear.

From this graph, we can see a clear correlation. Quality scores between 3 and 5 have similar levels of alcohol averages, but at quality scores higher than 5 we see that alcohol levels are higher as quality increases.

Let’s now investigate our variable that has the second highest correlation with quality, volatile acids.

Skipping immediately to our box plot we see an immediate pattern, the negative correlation between quality and volatile acidity. According to winefolly.com, acidity is what gives wine its tart and sour taste. It makes sense that too much of this flavor would lower a wine’s score.

In the univariate section, we noticed some outliers for volatile acid. Let’s take a quick look into the wines that had volatile acidity levels higher than 1.

Surprisingly, several of the high volatile acidity wines still received average scores. Unsurprisingly, 3 out of the 10 wines that received the lowest score of 3 had high levels.

Let’s now investigate the variable with the next highest correlation to quality, sulphates.

Median sulphates increase at each incremental level of quality. Mean does as well except between 3 and 4 of quality, but not by much and there aren’t very many wines with at those levels. The general trend shows quality increasing as sulphates increase.

Earlier in the univariate section, we saw some wines with very high sulphate levels above 1.5. The graph above shows that quality increases as sulphates do, but none of the high sulphate wines had enough data points to be represented in the graph above. Let’s see what the graph of those outliers looks like.

It appears that the sulphate outliers buck the trend of increased quality as none of the high sulphate wines scored above 6 and the wine with the highest sulphate level scored an awful 4. There aren’t enough observations with these high numbers to make any conclusions.

Let’s take a look at the variable with the fourth highest correlation to quality, citric acid.

There is a lot more variation here than the previous graphs, but focusing on the averages, We can see citric levels trending higher as quality increases and thus there is a correlation. Wines with higher levels of citric acid received higher quality scores.

We were able to provide visualizations to support the correlation between quality and the four variables with the highest r-squared values.

While I am far from a wine connoisseur, I tend to judge wines primarily by taste. According to winesandvines.com, balance between sweetness, alcohol, acid, and tannin is important to wine quality. Let residual sugar represent sweetness, and total acidity represent acid. We don’t have a good variable to represent tannin, but let’s see how the other three variables relate to each other.

There doesn’t see to be any correlation between the alcohol and total acidity, but I am curious to know if there is any combination between the two variables that helps determine quality.

Again, there doesn’t seem to be much of a relationship between total acidity and sugar, but again, let’s check in the next section if there is are specific combinations between the two that help determine quality.

The graph looks nearly identical to the previous one. I had heard that sugar is added to many wines that have a high alcohol content in order to mask the burning taste. Since residual sugars are the sugars left over after fermentation, it makes sense to see that the wines with the highest residual sugar levels have low levels of alcohol. Maybe residual sugar is a bad representation of sweetness levels in wine as it doesn’t seem to include any added sugar.

There doesn’t seem to be much much of a pattern between the three variables when graphed against each other. However, I would like to explore how the three variables compare also graphed with quality to see if there are any combinations of the three variables lead to better quality scores.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Our feature of interest is quality. I checked it’s relationship to the four variables that had a correlation coefficient of at least 2.0: alcohol, volatile acid, sulphates, and citric acid. When graphing the other features against quality, there was a lot of variance and it was hard to produce a linear approximation. For alcohol, volatile acid, and sulphates graphing the mean of quality produced a relatively linear relationship. For citric acid, a density graph along each level of quality showed the correlation.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Looking at the relationships between the three main flavor attributes, sweetness (residual sugar), alcohol, and acidity (total acidity), I was unable to find any relationships when graphed against each other. I’m hoping to find something interesting when quality is also included in the next section.

What was the strongest relationship you found?

The strongest relationship I found was the positive correlation between alcohol and quality. I am interested to see how this relationship changes in the next section when alcohol is compared to quality at different levels of acidity and sugar.

Multivariate Plots Section

In the previous section we checked the relationships between alcohol, total acidity, and residual sugar, but were unable to find any correlations between them. Let’s begin this section by revisiting those relationships but throwing quality into the mix.

As earlier, let’s begin with the alcohol vs total acidity.

Adding quality to the visualization shows us that wines with high alcohol and total acidity scored better on average than wines with lower levels. However, wines tend to be balanced in that those with the same quality score have either high comparative alcohol levels or high comparative total acidity levels, but not both.

There doesn’t seem to be any pattern to how acidity and sweetness together affect quality. We can see both high and low quality wines at nearly all combination levels of the two variables.

In this graph we can see a clear pattern with quality. Unfortunately, quality only seems to be affected by alcohol. As alcohol level increases so does quality, but as we increase along the x-axis quality stays the same as residual sugar increases. It’s a shame we aren’t provided with total sugar data. I believe that if we had a variable that better represented the sweet flavor profile of wines, we would have gotten more interesting results from the previous two visualizations.

Building Linear Model for Quality

##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = wines)
## m2: lm(formula = I(quality) ~ I(alcohol) + sulphates, data = wines)
## m3: lm(formula = I(quality) ~ I(alcohol) + sulphates + volatile.acidity,
##     data = wines)
## m4: lm(formula = I(quality) ~ I(alcohol) + sulphates + volatile.acidity +
##     citric.acid, data = wines)
##
## ============================================================================
##                          m1            m2            m3            m4
## ----------------------------------------------------------------------------
##   (Intercept)           1.875***      1.375***      2.611***      2.646***
##                        (0.175)       (0.177)       (0.196)       (0.201)
##   I(alcohol)            0.361***      0.346***      0.309***      0.309***
##                        (0.017)       (0.016)       (0.016)       (0.016)
##   sulphates                           0.994***      0.679***      0.696***
##                                      (0.102)       (0.101)       (0.103)
##   volatile.acidity                                 -1.221***     -1.265***
##                                                    (0.097)       (0.113)
##   citric.acid                                                    -0.079
##                                                                  (0.104)
## ----------------------------------------------------------------------------
##   R-squared             0.227         0.270         0.336         0.336
##   adj. R-squared        0.226         0.269         0.335         0.334
##   sigma                 0.710         0.690         0.659         0.659
##   F                   468.267       294.988       268.912       201.777
##   p                     0.000         0.000         0.000         0.000
##   Log-likelihood    -1721.057     -1675.142     -1599.384     -1599.093
##   Deviance            805.870       760.894       692.105       691.852
##   AIC                3448.114      3358.284      3208.768      3210.186
##   BIC                3464.245      3379.793      3235.654      3242.448
##   N                  1599          1599          1599          1599
## ============================================================================

The linear model can only account for 34% of variance, and citric acid didn’t improve that number at all.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There was an interesting interaction between alcohol and total acidity. The feature of interest, quality, would increase as alcohol and acidity increased, but only one of the variables would increase in comparison to the other. It appears as though strong flavors are preferred, but only one strong flavor and not both.

Were there any interesting or surprising interactions between features?

I found it surprising that residual sugar had little to do with quality scores when compared in conjuncture with another flavor variable. Since sweetness is a major component of flavor, which is a major component of wine quality, and since sweetness is added to help balance out bitter, burning from alcohol, and acidic flavors, I was expecting it to have a huge effect on quality.

One possible explanation for this is that residual sugar only represents the sugars left over after fermentation and not necessarily added sugars so the variable we use to approximate the sweetness of a wine is not accurate.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I created a model using the four variables from the bivariate section.

The model is limited as the four variables accounted for only 34% of variance in quality of the wines. Citric acid did not increase improve the R^2 value at all and could be left out of the model.

Final Plots and Summary

Plot One

Description One

Using these two graphs together gives you a lot of immediate information for your first exposure to each input variable. Right away we can see a steep normal distribution with half of the wines having a pH between 3.2 and 3.4. We can also see the outliers including the extremes having a pH of over 4.

Plot Two

Description Two

At first glance, it was hard to see the relationship between quality and the input variables. This visualization shows lower levels of volatile acidity as quality increases. Between quality level 7 and 8 mean increases very slightly, but despite this the trend is clear. Volatile acidity has a negative correlation with quality.

Plot Three

Description Three

This graph demonstrated that some variables, when together, affected quality differently than when they were on their own. Here, as quality was at its highest when one variable was high, but the other was relatively low. Wines that had high levels of both like the one with nearly 15% alcohol and over 15g/dm^3 of total acidity scored poorly.

Reflection

The wine dataset contained 1599 different red wines. Each wine contained 11 input attributes and one output attribute. My overall objective was to find how the input attributes affected quality scores. I began by looking at each of the variables to understand them. I then began comparing some of the of the variables to quality and saw how they correlated. Some clear trends emerged, especially between alcohol and quality. Comparing some of the input variables against each other gave some surprising results. Residual sugar had little relationship with any of the other variables even when combined against quality. While I didn’t find very strong correlations between quality and the input variables, I still made an attempt to create a model. Unfortunately, the model was only able to account for 34% of variance.

Several limitations to the model exist. To begin with, it’s low R^2 value. Also, quality values were subjective, which means bias. Finally, quality values had a very narrow distribution. None of the wines received scores of 1, 2, 9, or 10; and 82% of the wines had an average score of 5 or 6. This lack of differentiation made finding correlations difficult.

For further investigations, I would look more in depth on if the wines that received high quality scores (7 or 8) had any glaring differences from those that scored in the middle (5 or 6) or low (3 or 4) tier. A larger dataset would be useful, especially one that included some exceptional (9 or 10) and terrible (1 or 2) wines. More variables would be better as well, I would be most interested in knowing the total sugar and tannin levels to get the full scope of wine flavor.